1 Introduction

Team 010100 are the following members: Obumneke Amadi, Izzy Illari, Lucia Illari, Omar Qusous, and Lydia Teinfalt. You may find our work over on GitHub.

With the 2020 Olympics beginning this July in Tokyo we felt that a relevant discussion to have would be What makes an Olympian? What can we say about Olympians? Have there been any general trends amongst Olympians? What does the Olympic population look like? These questions are all suited to EDA, and with these questions in mind we went to see if we could find data on Olympians that would be readily available for us to analyze. Eventually our question morphed into the following: are there any specific characteristics (i.e. age, weight, height, BMI, country of origin) that could be used to describe an Olympian in general?

We were able to find a dataset called 120 years of Olympic history: athletes and results on Kaggle over here: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results. This historical dataset includes all Olympic Games from Athens 1896 to Rio 2016, which was scraped from https://www.sports-reference.com/. This data was compiled by a group of Olympic historians and statisticians. All of these individuals are members of the International Society of Olympic Historians (ISOH) and have been working on this project since the late 1990s.

The report is organized as follows:

  1. Summary of Dataset
  2. Description of Data
  3. Are there any common names among Olympians? (Name Data)
  4. Where do Olympians come from? (Geographical Data)
  5. Is the “Age”, “Weight”, and “Height” data normally distributed? (Normality Test of Numerical Columns)
  6. What are the body types of Olympic Athletes? (BMI Data)
  7. Does a country’s GDP and Population affect the number of medals that its athletes win? (GDP Data)
  8. Have there been any body (weight and height) trends among Olympians over the years? (Trends Over Time)
  9. Are Olympic Athletes a certain age? (Age Data)

2 Summary of Dataset

The data looks like the following:

'data.frame':   271116 obs. of  15 variables:
 $ ID    : int  1 2 3 4 5 5 5 5 5 5 ...
 $ Name  : Factor w/ 134732 levels "  Gabrielle Marie \"Gabby\" Adcock (White-)",..: 8 9 44318 29412 21470 21470 21470 21470 21470 21470 ...
 $ Sex   : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 1 ...
 $ Age   : int  24 23 24 34 21 21 25 25 27 27 ...
 $ Height: int  180 170 NA NA 185 185 185 185 185 185 ...
 $ Weight: num  80 60 NA NA 82 82 82 82 82 82 ...
 $ Team  : Factor w/ 1184 levels "30. Februar",..: 199 199 273 278 705 705 705 705 705 705 ...
 $ NOC   : Factor w/ 230 levels "AFG","AHO","ALB",..: 42 42 56 56 146 146 146 146 146 146 ...
 $ Games : Factor w/ 51 levels "1896 Summer",..: 38 49 7 2 37 37 39 39 40 40 ...
 $ Year  : int  1992 2012 1920 1900 1988 1988 1992 1992 1994 1994 ...
 $ Season: Factor w/ 2 levels "Summer","Winter": 1 1 1 1 2 2 2 2 2 2 ...
 $ City  : Factor w/ 42 levels "Albertville",..: 6 18 3 27 9 9 1 1 17 17 ...
 $ Sport : Factor w/ 66 levels "Aeronautics",..: 9 33 25 62 54 54 54 54 54 54 ...
 $ Event : Factor w/ 765 levels "Aeronautics Mixed Aeronautics",..: 160 398 349 710 623 619 623 619 623 619 ...
 $ Medal : Factor w/ 3 levels "Bronze","Gold",..: NA NA NA 2 NA NA NA NA NA NA ...

The athlete events data has 15 columns and 271116 rows/entries, for a total of 4066740 individual data points. In athelete_events each row corresponds to an individual athlete competing in an individual Olympic event. The variables are the following:

  1. ID: Unique number for each athlete
  2. Name: Athlete’s name
  3. Sex: M or F
  4. Age: Integer
  5. Height: centimeters
  6. Weight: kilograms
  7. Team: Team name
  8. NOC: National Olympic Committee 3-letter code
  9. Games: Year and season
  10. Year: Integer
  11. Season: Summer or Winter
  12. City: Host city
  13. Sport
  14. Event
  15. Medal: Gold, Silver, Bronze, or NA

To prepare our data for EDA we dropped the Olympic event: Art Sculpting. NAs were also removed.

3 Description of Data

We can look at the top events by number of athetes participating in these events. We can show this in a table or in a bar chart.

Sport freq
4 Athletics 3648
44 Swimming 2486
33 Rowing 2104
26 Ice Hockey 1301
25 Hockey 1168
23 Gymnastics 1161
18 Fencing 1109
20 Football 1084
12 Canoeing 1041
7 Basketball 1000
55 Wrestling 967
52 Volleyball 958
24 Handball 937
15 Cycling 845
53 Water Polo 764

4 Name Data

We have name data, so it might be interesting to take a look at it. Names carry implicit connotations regarding gender and nationality, and a word cloud could be a good way to visualize this to get a general sense of “who is going to the Olympics”. Word clouds, in this way, are suited for exploratory qualitative analysis for a column of data that would typically be ignored.

First thing’s first: for the name data, we wanted to look at only the first instance someone had gone to the Olympics. There were some individuals that participated in 30+ Olympic events in their lifetime, and we wanted to avoid counting each one of those instances in the name counts. This was done by looking at the “ID” column, as a unique ID is assigned to every athlete that particiates in the Olympics. Then, we can get the first and the last names from this new list that contains only the first time an athlete attending an Olympic event.

Out of curiosity, we can check to see who has the longest name to ever go to the Olympics, and that name is 108 characters long, and is Prince Max Emanuel Maria Alexander Vicot Bruno de la Santisima Trinidad y Todos los Santos von Hohenlohe Langenburg, who participated in the 1956 Winter Games in Alpine Skiing Men’s Downhill. His Wiki page is roughly 4 paragraphs long and apparently didn’t do much interesting things in his life than go to the Olympics once and be the uncle of someone who is, apparently, a pop singer and has also participated in many more Olympic Games.

First, we take a look at the first and last names. For the last names, there were a lot of “Jr.”, “sr.”, and “III” that actually popped up, and these were excluded from the counts, as they aren’t actually last names. If this hadn’t been removed, the most frequent “last” name would be “Jr.”!

We see here that most on the first names appear to be mens names, but the last names are interesting in it appears that there have been a lot of people with names that might be classified as Hispanic or Latino. We also notice that the names appear to be missing some vowels. However, since this is a word cloud created on only teh first instances of the athletes, I don’t think we can brush this off as a one-off typo or anything like that. It’s possible when recording names of atheletes they might drop of vowels to make them shorter and/or easier to record. It’s hard to know the exact reason for this. Clearly, however, the main take-away from these clouds are: if you want to be an Olympic athlete, you should think about changing you name to John Smith. It won’t guarantee that you’ll medal, however.

Nest, we look at the first names by sex:

What should strike us is that the first names for men looks pretty much exactly like the word cloud for all first names, regardless of sex. Why is this?

The simple answer is this: 27.487% of all athletes that have gone to the Olympics have been women. That’s a drastic difference! There is definitely room here to perform some analysis over time of how the participation of female athletes has increased, which countries have sent the most women, which countries were the first to send women, etc, but there wasn’t enough time to perform some of this analysis. Simply looking at the word clouds, though, provided some qualitative insight into the data.

5 Geographical Data

We have all this geographic data concerning where Olympic athletes are coming from - or, rather, the Teams for which they play. We can make two maps here, for the global geographic plot: one where we take the only the first instance of an Olympic athlete, to disregard those athletes that participated in 30+ events, and one where we find the heat map for all athletes, regardless of the number of events they went to. What I mean is, each time an athlete participates in an event, that is recorded in a new row with all the relevant information, so an athlete can potentially be counted multiple times if they have attended multiple events. The first table records the counts for only the first event an athlete attends. The second table is for all events.

Counts of olympic athletes (first event only)
NOC rank country percentage
USA 9617 USA 7.09
GBR 6272 UK 4.63
FRA 6150 France 4.54
ITA 4916 Italy 3.63
CAN 4791 Canada 3.53
GER 4617 Germany 3.41
JPN 4066 Japan 3.00
AUS 3795 Australia 2.80
SWE 3785 Sweden 2.79
POL 2966 Poland 2.19
,
Counts of olympic athletes (all events)
NOC rank country percentage
USA 18853 USA 6.95
FRA 12758 France 4.71
GBR 12256 UK 4.52
ITA 10715 Italy 3.95
GER 9830 Germany 3.63
CAN 9733 Canada 3.59
JPN 8444 Japan 3.12
SWE 8339 Sweden 3.08
AUS 7638 Australia 2.82
HUN 6607 Hungary 2.44

We see that there isn’t much difference between the two heat plots! What’s affected is the total count number, but we can see in the tables, where counts has been converted to percentages, that the percentages are roughly around the same values - for example, USA has sent 7% of all athletes that attend Olympic events.

Sports-Reference.com, where this data initially came from, had an interesting “frivolities”, where I noticed you can find the birthplaces of Olympic athletes. What I decided to do was scrap together some population data from teh US Census, and then collect the number of athletes born in teh first 100 cities that show up on the census, organized by population size (i.e., NYC, the city with the largest population, was ranked 1 on the census.) Interestingly, just being born in the US did not mean the athlete would play for the US team. For example, there was a person born in NYC that was on the UK team.

rank name state num.athletes pop2020 pop2010 change density latlng aland
1 New York New York 416 8622357 8175133 0.002 11084 40.663468,-73.938697 777934030
3 Chicago Illinois 276 2670406 2695598 -0.003 4535 41.837551,-87.681844 588808397
2 Los Angeles California 212 4085014 3792621 0.007 3365 34.019394,-118.410825 1213820883
7 Philadelphia Pennsylvania 206 1579504 1526006 0.002 4545 40.009376,-75.133346 347520038
69 St. Louis Missouri 160 297520 319294 -0.012 1854 38.635699,-90.244582 160462911
21 Boston Massachusetts 146 701984 617594 0.010 5606 42.338551,-71.018253 125209492
14 San Francisco California 111 906419 805235 0.010 7461 37.727239,-123.032229 121485107
46 Minneapolis Minnesota 106 427791 382578 0.008 3059 44.963324,-93.26832 139861493
20 Washington Dc District of Columbia 105 724342 601723 0.015 4574 38.904103,-77.017229 158351639
18 Seattle Washington 98 787740 608660 0.027 3628 47.619349,-122.351471 217128564

We see something interesting - while it makes sense that NYC would come up first for both US population and Olympic athlete births, cities like St. Louis come up fifth in number of Olympic athlete births, while being ranked 69 by US population. Who knows what’s going on in Missouri taht they’re sending so many athletes to the Olympics!

6 Normality Test of Numerical Columns

We can perform some traditional analysis of the numerical columns that make sense - while “ID” and “Years” are indeed numerical, it doesn’t make much sense to check for normality of this data. This leaves “Age”, “Height”, and “Weight”. We can first look at this data comparing the results for men and for women against each other and then plotting:

Descriptive statistics for weight data
Sex Mean Variance Standard.Deviation
F 61.3 108 10.4
M 76.5 188 13.7
Descriptive statistics for height data
Sex Mean Variance Standard.Deviation
F 169 72.4 8.51
M 179 89.5 9.46
Descriptive statistics for age data
Sex Mean Variance Standard.Deviation
F 22.9 26.5 5.14
M 25.0 33.4 5.78

Now, we can look at the “Age”, “Weight”, and “Height” data in totality, and check normality. To do so, we use two modified versions of the outlierKD2 function originally written by Klodian Dhana. One is edited to include the normal distribution on the histogram as well as a QQ-plot, and the other function works to perform the Lilliefore test for Normality on the data with and without outliers.

The choice to use the Lilliefore test was made because the Shapiro-Wilk test requires a sample size between 3 and 5000, which is clearly exceeded here, and the Kolmogorov-Smirnov test results in the error that ties should not be present. The Kolmogorov test is not a test for general normality but, instead, of a fully specified distribution. This means that \(\mu\) and \(\sigma\) are estimated from the data, and thus the p-values will be nonsense. The Lilliefors test (based on the Kolmogorov–Smirnov test) then is a test for the composite hypothesis of normality, used to test the null hypothesis that the data come from a normally distributed population. However, the null hypothesis does not specify which normal distribution; i.e., it does not specify the expected value and variance of the distribution. Thus, we employ it here.

The results of the Lilliefore test for weight data with outliers is:

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  var_name
D = 0.06, p-value <0.0000000000000002

The results of the Lilliefore test for weight data without outliers is:

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  var_name
D = 0.04, p-value <0.0000000000000002

The results of the Lilliefore test for height data with outliers is:

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  var_name
D = 0.03, p-value <0.0000000000000002

The results of the Lilliefore test for height data without outliers is:

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  var_name
D = 0.04, p-value <0.0000000000000002

The results of the Lilliefore test for age data with outliers is:

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  var_name
D = 0.1, p-value <0.0000000000000002

The results of the Lilliefore test for age data without outliers is:

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  var_name
D = 0.09, p-value <0.0000000000000002

Looking at those \(p\)-values, we see that we can reject the null hypothesis that the that the data come from a normally distributed population as the \(p\)-value is less than our significance level \(\alpha=0.05\).

7 BMI Data

Body Mass Index (BMI) is used to group individuals into weight categories that may lead to health problems. Olympic athletes represent top physical fitness so we wanted to see if Olympians have healthy weight according to the CDC.

BMI is calculated using athlete’s weight in kilograms divided by the square of their height in meters. The dataset stored height in centimeters so the formula was modified to convert their height to meter. The resulting number can be used to classify athtletes older than 20 into one of these groups: underweight, normal or healthy weight, overweight, and obese.

Weight.Category BMI
Underweight Less than 18.5
Healthy Weight Between 18.5 and 24.9
Overweight Between 25 and 29.9
Obese Greater than 30

7.1 BMI Boxplots of Athletes’ Height, Weight and Age

The above chart comparing height against categories show averages between female and male athletes are not equal across the board. It does not matter if the Olympian is underweight or obese, the male athletes are on average taller than the female athletes. For healthy and overweight athletes, there are more outliers in the data than the underweight and obese categories. This chart shows that average weight increases from underweight to obese. This is what we would expect. There is very little variability in weight for underweight Olympians and much greater variablity in the weight for obese athletes. As we found with height, the male athletes on average weigh more than the female athletes.

Average ages are close irrespective of weight category or gender. Notable are the number of outliers in age data for healthy and overweight Olympians.

7.2 Boxplot of All Olympic Athletes BMI

The histogram of relative frequency versus BMI of female and male Olympians clearly show that not all athletes have healthy weight according to the CDC.

7.3 Boxplot of BMI vs Sport (Summer)

We wondered about the differences between Olympians who competed in summer versus winter events.

In the summer events, the boxplots show that athletes in the overweight category competed in basketball, boxing, football, ice hockey, rugby, shooting, tug-of-war, weightlifting, and wrestling. There was great variability in BMI for weightlifters, at the top – they were considered obese. In In the underweight category were female rhythmic gymnasts and gymnasts.

7.4 Boxplot of BMI vs Sport (Winter)

In the winter events, the boxplots showed male athletes in the overweight category competed in alpine sking, bobsleigh, curling, freestyle skiing, ice hockey, luge, and snowboarding. The female figure skaters were in the underweight category. Given that there are not that many events in the winter Olympics but athletes in half of the events are upper end of healthy weight to being overweight, it seems that more Winter Olympians are in the upper end of normal to being overweight. The nature of winter events requiring athletes to be bulkier and bigger in order to be competitive.

7.5 Boxplot of BMI vs Top 10 Events

The boxplot showing BMI data for top 10 events with the most athletes is easier to read and confirms that for events such as basketball, handball, ice hockey, water polo, and wrestling the male athletes are considered overweight by CDC standards. The lower end of the weight categorization is gymnastics.

7.6 BMI of Medal Winning Athletes

We looked at sporting events in order to learn about summer versus winter athletes and found there are more overweight winter athletes. The events played a strong role in determing the BMI classification of the athletes. We wanted to find out winning a medal also show common characeristics between athletes. Did Gold/Silver/Bronze medalists have more in common with each other irrespective of events they competed in?

The boxplot and histograms showing BMI data for Gold/Silver/Bronze medal winners was relatively normal.

We created a scatterplot of height and weight of athletes that competed in the top 10 events to discern groupings based on event but did not find any conclusive visual evidence.

7.7 US Olympic Team

We haveve looked at data from the perspective of season, events and medals. Let’s focus on US Olympic Team data in table format to see if we can draw any conclusions.

Gymnasts on the US Olympic team are on the lower spectrum of healthy weight range. Wrestlers were second shortest but they have the highest BMI, making them classified as overweight. Consistent with the analysis we found in previous sections, US wrestlers and ice hockey players were classified as overweight.

Sport Avg Height (m) Avg Weight (kg) BMI
Gymnastics 1.64 59.4 21.9
Wrestling 1.73 76.8 25.2
Hockey 1.74 69.3 22.9
Football 1.75 70.8 22.9
Athletics 1.78 71.8 22.5
Cycling 1.78 73.2 23.1
Fencing 1.78 72.3 22.7
Ice Hockey 1.79 81.2 25.2
Canoeing 1.80 78.0 24.1
Handball 1.83 81.5 24.0
Swimming 1.83 76.2 22.5
Rowing 1.85 81.2 23.6
Water Polo 1.86 85.8 24.8
Volleyball 1.87 79.8 22.7
Basketball 1.92 86.8 23.3

7.8 T-Intervals 95% confidence

Variable Gender level 0.95
BMI Sex = M [23.982, 24.064]
BMI Sex = F [21.802, 21.906]

7.9 T-Test

The p-value 2.2e-16 being small enough to be considered 0, hence used a two-sided test we can reject the null hypothesis that the means of BMI values being equal between male and female Olympians The plots indicate that male Olympian athletes’ BMI to be greater than female Olympian athletes. Female and male athletes BMI: 21.9, 24.0

Variable Gender Average
BMI Sex = F [21.9]
BMI Sex = M [24]

7.10 Anova Test comparing BMI Averages of Athletes From Different Olympic Sporting Events

               Df Sum Sq Mean Sq F value              Pr(>F)    
Sport          14  16916    1208     205 <0.0000000000000002 ***
Residuals   18632 109780       6                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small p-value would confirm that there are significant differences in BMI averages of athletes in each sports which can be confirmed by the plots from sections 4.3-4.4

Many Olympic athletes’ builds are more muscular than the average person. They may have BMI that are technically categorized as overweight but it is not necessarily unhealthy. Many BMI charts point out it improperly categorizes athletes and our analysis confirms this.

8 GDP Data

We wanted to answer the question: Does The GDP and Population of a Country Affect the Number of Medals They Win and How Does It Affect the Sports They Excel In? In order to see if we can find an answer to that question the wealth and population of each participating country in the Olympics will be explored in order to build a macro-level insight into how an athlete’s environment may assist or hinder them in winning medals. Here the economic growth and population numbers of each country will be compared in order to ascertain how they may affect the athlete’s ability to win medals.

8.1 Additional Datasets

The Olympics dataset obtained previously is merged with datasets containing the Gross Domestic Product (GDP) and population of each country for the years 1960-2019. The datasources were:

Population:

https://knoema.com/UNWPP2019/world-population-prospects-2019

(obtained from worldbank.org)

GDP:

https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

Both datasets had missing data for several of the past decades and for some of the countries that did not participate in the Olympics. Also, for the recent years some of the countries that did participate still had missing data. These were filled by extrapolating the average of the years the preceded/followed the missing data. Furthermore, gross domestic product per capita (GDPpC) was calculated by dividing the GDP over the population. Categorical variables of the data are converted into ‘factor’ class.

8.2 EDA

As a first step to the EDA, for the year 2000 until present, bar charts of the total number of medals won by each country and gross domestic product per capita (GDPpC) were plotted for preliminary comparison.

Looking at a few randomly picked countries the following is obsereved:

Country n GDPpC rank Population
Luxembourg 0 1st 169th
Norway 216 3rd 119th
USA 1626 7th 3rd
Russia 933 61st 9th
Jordan 1 109th 88th
Ukraine 161 119th 35th

The table does not directly show a clear and distinct relationship between countries winning medals and their GDP/population. Scatter plots, histograms and pair plots (pairs.panels show bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal) are shown below to further investigate the relationship. To simplify the plots, the average GDP, population and GDP per capita are calculated for years of 2000 until present. Also, the outliers were identified using “boxplot.stats” and a plots are split into “with outliers” and “without outliers” are made.

8.3 Relationship between Medals Won and Competing Country’s GDP per Capita (year 2000-present)

The scatter plots show a positive uptrend. Also the pairplots show a stronger correlation between average GPD and number of medals won when compared to population and GDPpC. The table below summarizes the numbers

Pearsons cf Ave population Ave GDPpC Ave GDP
Medals won 0.33 0.26 0.85

8.4 Relationship Between Sports and Competing Country’s GDP per Capita (year 2000-present)

Another relationship to explore is the number of medals won in a particular sport and the GDP, population and GDPpC of the competing countries.The plots below show scatter and pair plots for three sports. Each exhibiting a different relationship to population, GDP and GDPpC.

The results showed that each sport has a different relationship with when compared with the relevant countries GDPpC. Where swimming has a positive uptrend with GDPpC, boxing is neutral and canoeing has a negative downtrend. The table below summarize the calculated correlation factor for three sports.

Number of medals won for sport: Ave population Ave GDP Ave GDPpC
Swimming 0.21 0.83 0.33
Boxing 0.09 0.18 0.04
Canoeing -0.17 -0.23 -0.16

The Residuals vs Leverage plots below show that none of the datapoints are highly influential against a regression line.

8.5 Additional Information

The results of this investigation could be further improved if the data on the percentage of the GDP that gets allocated to the Olympic Committee of each of these countries. Also, the world happiness report, Corruption Perceptions Index, life expectancy and literacy index could all be included to develop a well rounded look into the various factors that my affect an athlete’s performance due to the environment they lived in.

10 Age Data

10.1 Plots

Medal mean
Gold 25.9
Silver 26.0
Bronze 25.9

It appears that the mean age that an Olympic Medalist wins a medal is around 26 years old. We can also look at the ages of medal-winning athletes separated by the Summer and Winter Games.

From the plots we can see that between WW1 and WW2 the average age of medalists is decreasing, but after WW2 the average age temporarily rose. We see that the age begins to decrease until 1980 but then rises again after 1980. The age seems to plateau in the 2010s.

It seems that there are fewer peaks and dips in the Winter Games data than in the Summer Games data, where the Winter Athletes seem to have a smaller variance in age. We can look at the Summer and Winter Games together.

It appears that after the 1950s the athletes at the Winter Games, on average, are older. Both Summer and Winter Games experience an upward trend in ages after the 1980s.

We can also look at the breakdown of ages between Season and Gender.

When we look at the medal-winning athletes during the Summer Games by gender we see that in general men get medals at older ages than women do. Just as with the Summer Games we see that, on average, male athletes tend to be older than female athletes.

10.2 Summary of changes over time in ages of olympic athletes

While the weights and heights of the athletes seem to be affected by the event in which they particpate (or, rather, that athletes with extreme body types perhaps go to the Olympics more than individuals with “average” body types), the ages of the Olympic athletes seems to be affected by global events—which makes sense, considering that the Olympics are, in and of themselves, global events. The overall average age of all Olympians, even amongst the medals winners, is roughly 26 years old.

During the two World Wars the average ages of the athletes are higher than other periods, presumably because the younger athletes we fighting in the wars. A good three quarters of the Olympic athletes have been men, and so it makes sense to see this shift in the mean age when we suddenly lose a good majority of the young male athletes to war. After WW2 we see a decrease in the average age of Olympians, indicating that we have a new group of young atheletes participating in the games. Since roughly the 1980s, there has been a slow increase in the average ages of Olympians. On average female athletes have fairly consistently been younger than their male counterparts, and the Olympians participating in the Winter games are younger than the Olympians participating in the Summer games.